Hello out there! The goal of this project is to clean, analyze, and model the data from the NYPD Shooting Incident Data Report, which contains all occurrences from 2006 to 2023.
Data Source: https://catalog.data.gov/dataset/nypd-shooting-incident-data-historic
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
nypd_raw <- read_csv("https://data.cityofnewyork.us/api/views/833y-fsy8/rows.csv")
## Rows: 28562 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (12): OCCUR_DATE, BORO, LOC_OF_OCCUR_DESC, LOC_CLASSFCTN_DESC, LOCATION...
## dbl (7): INCIDENT_KEY, PRECINCT, JURISDICTION_CODE, X_COORD_CD, Y_COORD_CD...
## lgl (1): STATISTICAL_MURDER_FLAG
## time (1): OCCUR_TIME
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(nypd_raw)
summary(nypd_raw)
## INCIDENT_KEY OCCUR_DATE OCCUR_TIME BORO
## Min. : 9953245 Length:28562 Length:28562 Length:28562
## 1st Qu.: 65439914 Class :character Class1:hms Class :character
## Median : 92711254 Mode :character Class2:difftime Mode :character
## Mean :127405824 Mode :numeric
## 3rd Qu.:203131993
## Max. :279758069
##
## LOC_OF_OCCUR_DESC PRECINCT JURISDICTION_CODE LOC_CLASSFCTN_DESC
## Length:28562 Min. : 1.0 Min. :0.0000 Length:28562
## Class :character 1st Qu.: 44.0 1st Qu.:0.0000 Class :character
## Mode :character Median : 67.0 Median :0.0000 Mode :character
## Mean : 65.5 Mean :0.3219
## 3rd Qu.: 81.0 3rd Qu.:0.0000
## Max. :123.0 Max. :2.0000
## NA's :2
## LOCATION_DESC STATISTICAL_MURDER_FLAG PERP_AGE_GROUP
## Length:28562 Mode :logical Length:28562
## Class :character FALSE:23036 Class :character
## Mode :character TRUE :5526 Mode :character
##
##
##
##
## PERP_SEX PERP_RACE VIC_AGE_GROUP VIC_SEX
## Length:28562 Length:28562 Length:28562 Length:28562
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## VIC_RACE X_COORD_CD Y_COORD_CD Latitude
## Length:28562 Min. : 914928 Min. :125757 Min. :40.51
## Class :character 1st Qu.:1000068 1st Qu.:182912 1st Qu.:40.67
## Mode :character Median :1007772 Median :194901 Median :40.70
## Mean :1009424 Mean :208380 Mean :40.74
## 3rd Qu.:1016807 3rd Qu.:239814 3rd Qu.:40.82
## Max. :1066815 Max. :271128 Max. :40.91
## NA's :59
## Longitude Lon_Lat
## Min. :-74.25 Length:28562
## 1st Qu.:-73.94 Class :character
## Median :-73.92 Mode :character
## Mean :-73.91
## 3rd Qu.:-73.88
## Max. :-73.70
## NA's :59
colSums(is.na(nypd_raw))
## INCIDENT_KEY OCCUR_DATE OCCUR_TIME
## 0 0 0
## BORO LOC_OF_OCCUR_DESC PRECINCT
## 0 25596 0
## JURISDICTION_CODE LOC_CLASSFCTN_DESC LOCATION_DESC
## 2 25596 14977
## STATISTICAL_MURDER_FLAG PERP_AGE_GROUP PERP_SEX
## 0 9344 9310
## PERP_RACE VIC_AGE_GROUP VIC_SEX
## 9310 0 0
## VIC_RACE X_COORD_CD Y_COORD_CD
## 0 0 0
## Latitude Longitude Lon_Lat
## 59 59 59
nypd = subset(nypd_raw, select=-c(X_COORD_CD, Y_COORD_CD, Lon_Lat))
nypd
library(dplyr)
nypd <- distinct(nypd)
nypd <- nypd %>% replace(.== "NULL", "UNKNOWN")
nypd <- nypd %>% replace(.== "(null)", "UNKNOWN")
# nypd <- nypd %>% replace(.== "UNKNOWN", NA)
unique(nypd$JURISDICTION_CODE)
## [1] 0 2 1 NA
nypd <- nypd %>% drop_na(JURISDICTION_CODE)
nypd <- nypd %>% drop_na(Latitude)
nypd <- nypd %>% drop_na(OCCUR_DATE)
nypd <- nypd %>% drop_na(OCCUR_TIME)
nypd[is.na(nypd)] <- "UNKNOWN"
nypd
all_cols <- lapply(nypd, unique)
unique_cols <- lengths(all_cols)
unique_cols
## INCIDENT_KEY OCCUR_DATE OCCUR_TIME
## 22349 6093 1423
## BORO LOC_OF_OCCUR_DESC PRECINCT
## 5 3 77
## JURISDICTION_CODE LOC_CLASSFCTN_DESC LOCATION_DESC
## 3 10 40
## STATISTICAL_MURDER_FLAG PERP_AGE_GROUP PERP_SEX
## 2 9 4
## PERP_RACE VIC_AGE_GROUP VIC_SEX
## 7 7 3
## VIC_RACE Latitude Longitude
## 7 13385 13373
Here we are performing some more data cleaning and using those cleaned variables to create some visualizations that tell a story about the data overall.
unique(nypd$VIC_AGE_GROUP)
## [1] "25-44" "18-24" "45-64" "65+" "<18" "UNKNOWN" "1022"
nypd <- nypd[!(nypd$VIC_AGE_GROUP %in% "1022"), ]
library(ggplot2)
ggplot(data=nypd, aes(x=VIC_AGE_GROUP)) +
geom_bar(fill = 'lightblue') +
ggtitle("Victim Age Groups") +
xlab("Age Group") +
ylab("Total") + theme_minimal()
nypd <- nypd[!(nypd$PERP_AGE_GROUP %in% c("940", "224", "1020")), ]
nypd$PERP_SEX[nypd$PERP_SEX == "U"] <- "UNKNOWN"
ggplot(data=nypd, aes(x=PERP_AGE_GROUP)) +
geom_bar(fill = 'purple') +
ggtitle("Perpetrator Age Groups") +
xlab("Age Group") +
ylab("Total") + theme_minimal()
It looks like the majority of perpetrators do not belong to a specific age group. But the rest are similar in age to their victims, 18-24 and 25-44.
ggplot(nypd, aes(x=Latitude, y=Longitude)) +
geom_point(aes(color=factor(BORO))) +
ggtitle("Location of Occurrence by Borough") +
labs("Borough") + theme_minimal()
ggplot(data=nypd, aes(x=BORO)) +
geom_bar(fill = 'pink') +
ggtitle("Occurrences by Borough") +
xlab("Borough") +
ylab("Total") + theme_minimal()
The Bronx and Brooklyn have more cases than the rest of the boroughs.
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
tabyl(nypd, BORO, VIC_AGE_GROUP)
ggplot(data=nypd, aes(x=STATISTICAL_MURDER_FLAG)) + geom_bar(fill = 'blue') +
ggtitle("Statistical Murder Flags") +
xlab("Murder Flagged") +
ylab("Total") + theme_minimal()
Here we can see that there is a large imbalance between occurrences flagged as murder and those that are not flagged. The majority of events are not flagged.
tabyl(nypd, PERP_AGE_GROUP, PERP_RACE, PERP_SEX)
## $F
## PERP_AGE_GROUP AMERICAN INDIAN/ALASKAN NATIVE ASIAN / PACIFIC ISLANDER BLACK
## <18 0 0 26
## 18-24 1 1 92
## 25-44 0 4 144
## 45-64 0 0 20
## 65+ 0 0 0
## UNKNOWN 0 0 10
## BLACK HISPANIC UNKNOWN WHITE WHITE HISPANIC
## 2 0 1 10
## 17 1 5 43
## 7 0 12 32
## 1 0 1 4
## 1 0 0 0
## 2 4 0 2
##
## $M
## PERP_AGE_GROUP AMERICAN INDIAN/ALASKAN NATIVE ASIAN / PACIFIC ISLANDER BLACK
## <18 0 14 1215
## 18-24 0 48 4479
## 25-44 1 81 4186
## 45-64 0 10 427
## 65+ 0 0 30
## UNKNOWN 0 11 1231
## BLACK HISPANIC UNKNOWN WHITE WHITE HISPANIC
## 144 14 9 235
## 606 56 47 1012
## 464 56 136 902
## 47 5 58 124
## 4 1 18 11
## 91 221 11 125
##
## $UNKNOWN
## PERP_AGE_GROUP AMERICAN INDIAN/ALASKAN NATIVE ASIAN / PACIFIC ISLANDER BLACK
## <18 0 0 1
## 18-24 0 0 7
## 25-44 0 0 3
## 45-64 0 0 0
## 65+ 0 0 0
## UNKNOWN 0 0 7
## BLACK HISPANIC UNKNOWN WHITE WHITE HISPANIC
## 0 2 0 0
## 2 7 0 0
## 0 4 0 0
## 0 0 0 0
## 0 0 0 0
## 0 11891 0 0
# library(vtree)
# vtree(nypd, c("PERP_AGE_GROUP", "PERP_SEX", "PERP_RACE"), horiz = TRUE, fillcolor = c(PERP_AGE_GROUP = "#e7d4e8", PERP_SEX = "#99d8c9", PERP_RACE = "#9ecae1"), keep = list(PERP_SEX = c("M", "F", "UNKNOWN")), showcount = FALSE)
# vtree(nypd, c("BORO", "LOC_OF_OCCUR_DESC"), fillcolor = c(BORO="#e7d4e8", LOC_OF_OCCUR_DESC="#99d8c9"), horiz=TRUE)
# nypd[["OCCUR_DATE"]] <- as.POSIXct(nypd[["OCCUR_DATE"]], format ="%m-%d-%Y")
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
library(CGPfunctions)
ggplot(data=nypd, aes(x=PRECINCT)) +
geom_bar(fill = 'brown') +
ggtitle("Occurrences by Precinct") +
xlab("Precinct Number") +
ylab("Total") + theme_minimal()
# PlotXTabs(nypd, LOC_OF_OCCUR_DESC, BORO)
# PlotXTabs(nypd, PERP_SEX, PERP_AGE_GROUP)
# PlotXTabs(nypd, VIC_SEX, VIC_AGE_GROUP)
library(lubridate)
nypd_date <- nypd %>% mutate(date=mdy(OCCUR_DATE))
nypd_date
nypd_total <- sqldf('SELECT date, COUNT(date) as total FROM nypd_date GROUP BY date ORDER BY date')
nypd_total
nypd_cumsum_total <- nypd_total %>% mutate(cum_total=cumsum(total))
nypd_cumsum_total <- subset(nypd_cumsum_total, select=-c(total))
nypd_cumsum_total
The two graphs below are interactive and show that cases have steadily increased (not too fast!) over the years but there is a slight dip in June 2020 followed by an increase in July 2020.
library(TSstudio)
ts_plot(nypd_cumsum_total,
title = "NYPD Historic Shooting Data Cumulative Totals 2006-2023",
Xtitle = "Year",
Ytitle = "Cumulative Total Events",
slider = TRUE)
ts_plot(nypd_total,
title = "NYPD Historic Shooting Data Daily Totals",
Xtitle = "Date",
Ytitle = "Total",
slider = TRUE)
After converting our feature and target variables using one-hot encoding methods, we’ll be splitting the data 80-20% for training and testing.
If the logistic regression model gives an installation error, try install.packages(‘glmnet’, dependencies=TRUE, type=“binary”).
library(datasets)
library(caTools)
library(party)
## Loading required package: grid
## Loading required package: mvtnorm
## Loading required package: modeltools
## Loading required package: stats4
## Loading required package: strucchange
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
##
## Attaching package: 'strucchange'
## The following object is masked from 'package:stringr':
##
## boundary
##
## Attaching package: 'party'
## The following object is masked from 'package:dplyr':
##
## where
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom 1.0.6 ✔ rsample 1.2.1
## ✔ dials 1.3.0 ✔ tune 1.2.1
## ✔ infer 1.0.7 ✔ workflows 1.1.4
## ✔ modeldata 1.4.0 ✔ workflowsets 1.1.0
## ✔ parsnip 1.2.1 ✔ yardstick 1.3.1
## ✔ recipes 1.1.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ magrittr::extract() masks tidyr::extract()
## ✖ dplyr::filter() masks stats::filter()
## ✖ parsnip::fit() masks infer::fit(), party::fit(), modeltools::fit()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tune::parameters() masks dials::parameters(), modeltools::parameters()
## ✖ magrittr::set_names() masks purrr::set_names()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## ✖ recipes::update() masks stats4::update(), stats::update()
## ✖ party::where() masks dplyr::where()
## • Dig deeper into tidy modeling with R at https://www.tmwr.org
library(readr)
nypd_final <- subset(nypd_date,
select=-c(INCIDENT_KEY,
OCCUR_DATE,
Longitude,
Latitude,
OCCUR_TIME,
date))
nypd_final
tar_var <- nypd_final[["STATISTICAL_MURDER_FLAG"]]
nypd_final <- subset(nypd_final, select=-c(STATISTICAL_MURDER_FLAG))
nypd_final
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following objects are masked from 'package:yardstick':
##
## precision, recall, sensitivity, specificity
## The following object is masked from 'package:purrr':
##
## lift
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:zoo':
##
## yearmon, yearqtr
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
dummy <- dummyVars(" ~ .", data=nypd_final)
nypd_new <- data.frame(predict(dummy, newdata = nypd_final))
nypd_new$STATISTICAL_MURDER_FLAG = c(tar_var)
nypd_new$STATISTICAL_MURDER_FLAG <- as.factor(as.logical(nypd_new$STATISTICAL_MURDER_FLAG))
nypd_new
set.seed(42)
nypd_split <- sample.split(nypd_new, SplitRatio = 0.8)
train_data <- subset(nypd_new, nypd_split == TRUE)
test_data <- subset(nypd_new, nypd_split == FALSE)
rmodel <- logistic_reg(mixture = double(1), penalty = double(1)) %>% set_engine("glmnet") %>% set_mode("classification") %>% fit(STATISTICAL_MURDER_FLAG ~ ., data = train_data)
tidy(rmodel)
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-8
pred_class <- predict(rmodel, new_data = test_data, type = 'class')
pred_proba <- predict(rmodel, new_data = test_data, type = 'prob')
results <- test_data %>% select(STATISTICAL_MURDER_FLAG) %>% bind_cols(pred_class, pred_proba)
accuracy(results, truth = STATISTICAL_MURDER_FLAG, estimate = .pred_class)
conf_mat(results, truth = STATISTICAL_MURDER_FLAG, estimate = .pred_class)
## Truth
## Prediction FALSE TRUE
## FALSE 4717 1074
## TRUE 17 19
library(Metrics)
##
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
##
## precision, recall
## The following objects are masked from 'package:yardstick':
##
## accuracy, mae, mape, mase, precision, recall, rmse, smape
results$gtruth <- as.integer(as.logical(results$STATISTICAL_MURDER_FLAG))
results$pred_class <- as.integer(as.logical(results$.pred_class))
precision(results$gtruth, results$pred_class)
## [1] 0.5277778
accuracy(results$gtruth, results$pred_class)
## [1] 0.8127681
recall(results$gtruth, results$pred_class)
## [1] 0.01738335
auc(results$gtruth, results$pred_class)
## [1] 0.5068962
rmse(results$gtruth, results$pred_class)
## [1] 0.432703
When it comes to data cleaning and manipulation, regardless of the topic, biases always find a way to influence our decisions when it comes to how or why we perform certain procedures. When I was looking at this data set, I noticed that the majority of shooting events reported were from people ages 18-44 with the reminder being under 18 and 45+. This leaves teenagers under 18 and the elderly susceptible to bias when building the model since there isn’t a lot of data available for them, making it easier for analysts and scientists to overlook their importance. The same thing can be said for perpetrators although their bias comes from lack of information since the majority of them do not have an assigned age group. In addition, I noticed that the majority of crimes were committed by males, which can easily make any future models biased against males and more lenient towards females. But when it comes to crimes reported, most of those are also from males, making females an incredibly underrepresented group. This could lead to all sorts of problems for models since the odds are overwhelmingly in favor of one gender over another. It’s fair to say that most of these biases are from gaps in data and other unknown factors. In many cases, we cannot go back in time to retrieve any lost data so it’s important to look at what’s available with impartiality to ensure the best possible outcome.
results